author: “Jonathan Lau”
This report showcases the power of R for social media analysis - Twitter in this specific case. The analysis below is only cursory and was prepared over a few days. The example code below focusses on generating insight for TEDxGlasgow and the topic of climate from Twitter data. The note below provides some context when reading this report.
NB
I have applied to Twitter for a developer licence to see what data options I have as an individual. A higher tier of access would provide for more comprehensive and historical analysis.
This is an introduction to the power of Twitter data and what you can achieve using social media analysis.
The first step is accessing tweets via the Twitter API and then leveraging the power of the rtweet libary. Next we can use this extracted data for social media analysis.
The volume and velocity of tweets posted on twitter every second is an indicator of the power of twitter data.
The enormous amount of information available, from the tweet text and its metadata, gives great scope for analyzing extracted tweets and deriving insights.
The following code extracts a 1% random sample of live tweets using stream_tweets() for a 30 second window and saves it into a data frame.
The dimensions of the data frame will give insights about the number of live tweets extracted and the number of columns that contain the actual tweeted text and metadata on the tweets.
# Load libraries
library(rtweet)
library(httpuv)
library(tidyverse)
# Extract live tweets for 30 seconds window
tweets30s <- stream_tweets("", timeout = 30)
## Streaming tweets for 30 seconds...
## Finished streaming tweets!
# View dimensions of the data frame with live tweets
dim(tweets30s)
## [1] 1842 90
Comment
[Twitter allows the extraction of only a limited number of tweets with a free account.]
Many functions are available in R to extract twitter data for analysis.
search_tweets() is a powerful function from rtweet which is used to extract tweets based on a search query.
The function returns a maximum of 18,000 tweets for each request posted.
# Extract tweets on "#TedXGlasgow" and include retweets
twts_tedx <- search_tweets("TEDxGlasgow",
n = 18000,
include_rts = TRUE,
lang = "en")
# View tweets
twts_tedx %>%
relocate(text, screen_name)
# Extract tweets on "TEDxGlaClimate" and include retweets
twts_tedx_climate <- search_tweets("TEDxGlaClimate",
n = 18000,
include_rts = TRUE,
lang = "en")
# View tweets
twts_tedx_climate %>%
relocate(text, screen_name)
You can see various tweets posted by users.
Similar to search_tweets(), get_timeline() is another function in the rtweet library that can be used to extract tweets.
The get_timeline() function is different from search_tweets(). It extracts tweets posted by a given user to their timeline instead of searching based on a query.
The get_timeline() function can extract upto 3200 tweets at a time.
Tweets posted by TedXGlasgow to their timeline are extracted below.
# Extract tweets posted by the user @TedXGlasgow
get_TedX <- get_timeline("@TedXGlasgow", n = 3200)
# View output
get_TedX
Comment
The metadata components of extracted twitter data can be analyzed to derive insights.
To identify twitter users who are interested in a topic, you can look at users who tweet often on that topic. The insights derived can be used to promote targeted events to interested users.
The code below identifies users who have tweeted often on the topic “TedXGlasgow”.
# Create a table of users and tweet counts for the topic
sc_name <- table(get_TedX$screen_name)
# Sort the table in descending order of tweet counts
sc_name_sort <- sort(sc_name, decreasing = TRUE)
# View sorted table for top 10 users
head(sc_name_sort, 10)
## TEDxGlasgow
## 3183
Comment
The follower count for a twitter account indicates the popularity of the personality or a business entity and is a measure of influence in social media.
Knowing the follower counts helps digital marketers strategically position ads on popular twitter accounts for increased visibility.
The follow code extracts user data and compare followers count for twitter accounts of popular Scottish news sites.
# Extract user data for the twitter accounts of news sites and Darin O Lien for comparison
users <- lookup_users(c("DarinOlien", "ScotEntNews", "BBCScotlandNews", "STVNews", "heraldscotland", "Scotland", "VisitScotNews", "TheScotsman", "BBCRadioScot"))
# Create a data frame of screen names and follower counts
user_df <- users[,c("screen_name","followers_count")]
# Display and compare the follower counts for the 4 news sites
user_df
Comment
A retweet helps utilize existing content to build a following for your brand.
The number of times a twitter text is retweeted indicates what is trending. The inputs gathered can be leveraged by promoting your brand using the popular retweets.
The code below identifies tweets on “TEDxGlasClimate” that have been retweeted the most.
# Create a data frame of tweet text and retweet count
rtwt <- twts_tedx_climate[,c("text", "retweet_count")]
head(rtwt)
# Sort data frame based on descending order of retweet counts
rtwt_sort <- arrange(rtwt, desc(retweet_count))
# Exclude rows with duplicate text from sorted data frame
rtwt_unique <- unique(rtwt_sort, by = "text")
# Print top 6 unique posts retweeted most number of times
rownames(rtwt_unique) <- NULL
head(rtwt_unique)
Comment
It’s time to go deeper by applying filters to tweets; and analysing Twitter user data using the golden ratio and the Twitter lists they subscribe to. Then we can extract trending topics and analyse Twitter data over time to identify interesting insights.
An original tweet is an original posting by a twitter user and is not a retweet, quote, or reply.
The “-filter” can be combined with a search query to exclude retweets, quotes, and replies during tweet extraction.
# Extract 5000 original tweets on "Climate"
tweets_org <- search_tweets("Climate -filter:retweets -filter:quote -filter:replies", n = 5000)
# Check for presence of replies
tweets_org %>%
count(reply_to_screen_name)
# Check for presence of quotes
tweets_org %>%
count(is_quote)
# Check for presence of retweets
tweets_org %>%
count(is_retweet)
For (just shy of) the 5000 tweets, the output of NA for reply_to_screen_name and FALSE for is_quote and is_retweets confirms that the filtered tweets are original posts and not replies, quotes, or retweets.
You can use the language filter with a search query to filter tweets based on the language of the tweet.
The filter extracts tweets that have been classified by Twitter as being of a particular language.
# Extract tweets on "Climate" in French
tweets_french <- search_tweets("Climate", lang = "fr")
# Display the tweets and language metadata
tweets_french %>%
select(text, lang)
Popular tweets are tweets that are retweeted and favourited several times.
They are useful in identifying current trends. A brand can promote its merchandise and build brand loyalty by identifying popular tweets and retweeting them.
The code below extracts tweets on “TEDx” that have been retweeted a minimum of 100 times and also favorited at least by 100 users.
# Extract tweets with a minimum of 100 retweets and 100 favorites
tweets_pop <- search_tweets("TEDx min_retweets:100 AND min_faves:100")
# Create a data frame to check retweet and favorite counts
counts <- tweets_pop[c("retweet_count", "favorite_count")]
head(counts)
# View the tweets
head(tweets_pop$text)
## [1] "Tedx ''Asla pes etme!'' temalı konuşma izledikten sonra ben\n\nhttps://t.co/12JtOVwVyp"
## [2] "Listen to #BittrexGlobal's new podcast, The Bit, hosted by @tomalbright (CEO) & @StephenStonberg (COO/CFO). \n\nOur first episode features @HenriArslanian a #TEDx speaker & thought leader in the #crypto space. #TheBitPodcast \n\nhttps://t.co/KuIHlEV9ez https://t.co/0STmAE5Ipe"
You can see that the extracted tweets received a minimum of 100 retweets and 100 favorites. You can also change the minimum value for retweets and favorites from 100 to a higher number if required.
Analyzing twitter user data provides vital information which can be used to plan relevant promotional strategies.
User information contains data on the number of followers and friends of the twitter user.
The user information may have multiple instances of the same user as the user might have tweeted multiple times on a given subject. You need to take the mean values of the follower and friend counts in order to consider only one instance.
#TEDxGlasgow related
# Extract user information of people who have tweeted on the TEDxGlasgow
user_cos <- users_data(twts_tedx)
# View few rows of user data
head(user_cos)
# Aggregate screen name, follower and friend counts
counts_df <- user_cos %>%
group_by(screen_name) %>%
summarise(follower = mean(followers_count, na.rm = TRUE),
friend = mean(friends_count, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
# View the output
counts_df
The screen names have been tabulated with their corresponding counts of followers and friends. In the next exercise, you will learn how to use this data to calculate the golden ratio.
The ratio of the number of followers to the number of friends a user has is called the golden ratio.
This ratio is a useful metric for marketeers to strategize promotions.
# Calculate and store the golden ratio
counts_df$ratio <- counts_df$follower/counts_df$friend
# Sort the data frame in decreasing order of follower count
counts_sort <- arrange(counts_df, desc(follower))
# View the first few rows
head(counts_sort)
# Select rows where the follower count is greater than 50000
counts_sort[counts_sort$follower>50000,]
# Select rows where the follower count is less than 1000
counts_sort[counts_sort$follower<1000,]
Users having a high follower count should have a high positive ratio too. These users can be used as a medium to promote a brand to a wide audience.
A twitter list is a curated group of twitter accounts.
Twitter users subscribe to lists that interest them. Collecting user information from twitter lists could help brands promote products to interested customers.
The code below extracts lists of the twitter account of “TEDxGlasgow”.
# Loading library
library(tidyverse)
# Extract all the lists "TEDx" subscribes to and view the first 4 columns
lst_TEDx <- lists_users("TEDxGlasgow")
lst_TEDx %>%
arrange(desc(subscriber_count)) %>%
head()
# Extract subscribers of the list "TEDx" and view the first 4 columns
list_TED_sub <- lists_subscribers("9783131", n = 500) %>%
arrange(followers_count)
list_TED_sub[,1:4]
# Create a list of top screen names from the subscribers list
users <- list_TED_sub$screen_name %>%
head()
# Extract user information for the list and view the first 4 columns
users_TEDx_sub <- lookup_users(users)
users_TEDx_sub
You now have extracted user data of potential customers to whom you can promote TEDxGlasgow.
Location-specific trends identify popular topics trending in a specific location. You can extract trends at the country level or city level.
It is more meaningful to extract trends around a specific region, in order to focus on twitter audience in that region for targeted marketing of a brand.
What is trending in the UK?
# Get topics trending in UK
gt_country <- get_trends("United Kingdom") %>%
arrange(desc(tweet_volume)) %>%
view()
It is meaningful to extract trends around a specific region to focus on twitter audience in that region.
Trending topics in a city provide a chance to promote region-specific events or products.
This code extracts topics that are trending in Glasgow and also look at the most tweeted trends.
Note: tweet_volume is returned for trends only if this data is available.
# Get topics trending in Glasgow
gt_city <- get_trends("Glasgow")
# View the first 6 columns
head(gt_city[,1:6])
# Aggregate the trends and tweet volumes
trend_df <- gt_city %>%
group_by(trend) %>%
summarise(tweet_vol = mean(tweet_volume, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
# Sort data frame on descending order of tweet volumes and print header
trend_df_sort <- arrange(trend_df, desc(tweet_vol))
head(trend_df_sort,10)
Trends can change quickly
The most-tweeted trend in Glasgow has recently been: * ‘yeonjun’ a Kpop star!
* #Gray (because of Eastenders??) * #Fridaythoughts
Visualizing the frequency of tweets over time helps understand the interest level over a product.
It would be interesting to check the interest level and recall for ClimateCrisis by visualizing the frequency of tweets.
# Extract tweets on #ClimateCrisis and exclude retweets
ClimateCrisis_twts <- search_tweets("#ClimateCrisis", n = 18000, include_rts = FALSE)
# View the output
head(ClimateCrisis_twts)
# Create a time series plot
ts_plot(ClimateCrisis_twts, by = "hours", color = "blue")
Comment
ClimateCrisis appears to have a cyclical pattern over the days accessed - with lowering peaks most recently.
A time series object contains the aggregated frequency of tweets over a specified time interval.
Creating time series objects is the first step before visualizing tweet frequencies for comparison.
This code creates time series objects for two TED events for comparison
# Create a time series object for TEDxGlasgow at hourly intervals
TEDxGlasgow_ts <- ts_data(twts_tedx, by = "hours")
# Rename the two columns in the time series object
names(TEDxGlasgow_ts) <- c("time", "TEDxGlasgow_n")
# View data
TEDxGlasgow_ts
# Create a time series object for TEDxGlaClimate at hourly intervals
TEDxGlaClimate_ts <- ts_data(twts_tedx_climate, by = "hours")
# Rename the two columns in the time series object
names(TEDxGlaClimate_ts) <- c("time", "TEDxGlasClimate_n")
# View datax
TEDxGlaClimate_ts
# Get TEDsummit2019 data for comparison
twts_tedsummit2019 <- search_tweets("#TEDSummit2019",
n = 18000,
include_rts = TRUE,
lang = "en")
# Create a time series object for TEDxEdinburgh at hourly intervals
tedsummit2019_ts <- ts_data(twts_tedsummit2019, by = "hours")
# Rename the two columns in the time series object
names(tedsummit2019_ts) <- c("time", "edin_n")
# View data
tedsummit2019_ts
Time series objects aggregate tweet frequencies over time. They are useful for creating time series plots for comparison.
A picture is worth a thousand words! This following code explores how you can visualize text from tweets using bar plots and word clouds. Tweet text will be processed to prepare a clean text corpus for analysis. Imagine being able to extract key discussion topics and people’s perceptions about a subject or brand from the tweets they are sharing. This is possible using topic modeling and sentiment analysis.
Tweet text posted by twitter users is unstructured, noisy, and raw.
It contains emoticons, URLs, and numbers. This redundant information has to be cleaned before analysis in order to yield reliable results.
The code below removes URLs and replaces characters other than letters with spaces from ClimateCrisis tweets.
# Loading Regex library
library(qdapRegex)
# Extract tweet text from ClimateCrisis dataset
twt_txt <- ClimateCrisis_twts$text
head(twt_txt)
## [1] "@6point626 Indeed David - it's disgraceful. As a Green Party member, I can't adequately express just how much this incoherent BS from \"fellow\" (and I use this term advisedly) European Greens grieves and frustrates me. \n\n#ClimateCrisis?#WhatClimateCrisis?"
## [2] "@RBKgreens The Belgian Greens must be very proud of themselves... \U0001f612\n\n#ClimateCrisis?#WhatClimateCrisis?\n\nhttps://t.co/ui3JfjvKTz https://t.co/bxQ68aUDbc"
## [3] "Really fantastic news from the Net-Zero Asset Owner Alliance! Moving $5trillion of investments to zero emissions companies. \n\n#NetZero #ClimateAction\n#ClimateCrisis #GoodNewsFriday #SomeGoodNews #positivenews #ShareSomethingGood\n\nhttps://t.co/bQqXzqMCQJ"
## [4] "#Cities are – and should be – at the forefront of #climateaction.\n\nhttps://t.co/CKNQKlHKzO\n\nSource: @EURACTIV \n#ClimateAction #climatecrisis #greenrecovery #sustainablecities"
## [5] "\"With more #greeninfrastructure, large metropolitan areas will become cleaner, cooler, healthier, and more resilient to #climatechange.\"\n\n#greencities #greenarchitecture #sustainability #greenspaces #citiesforpeople #sustainablecities #resilientcities #climatecrisis #smartcities"
## [6] "@threesleepydogs @DoroGrelle @realJeff45 @guardian You found one 'scientist'. What about the 7 million scientists, 100s of thousands of peer reviewed studies\nand every major science organization, colleges and institutions that all agree on AGW? #ClimateAction #ClimateChange #ClimateCrisis #ClimateBrawl"
# Remove URLs from the tweet text and view the output
twt_txt_url <- rm_twitter_url(twt_txt)
head(twt_txt_url)
## [1] "@6point626 Indeed David - it's disgraceful. As a Green Party member, I can't adequately express just how much this incoherent BS from \"fellow\" (and I use this term advisedly) European Greens grieves and frustrates me. #ClimateCrisis?#WhatClimateCrisis?"
## [2] "@RBKgreens The Belgian Greens must be very proud of themselves... \U0001f612 #ClimateCrisis?#WhatClimateCrisis?"
## [3] "Really fantastic news from the Net-Zero Asset Owner Alliance! Moving $5trillion of investments to zero emissions companies. #NetZero #ClimateAction #ClimateCrisis #GoodNewsFriday #SomeGoodNews #positivenews #ShareSomethingGood"
## [4] "#Cities are – and should be – at the forefront of #climateaction. @EURACTIV #ClimateAction #climatecrisis #greenrecovery #sustainablecities"
## [5] "\"With more #greeninfrastructure, large metropolitan areas will become cleaner, cooler, healthier, and more resilient to #climatechange.\" #greencities #greenarchitecture #sustainability #greenspaces #citiesforpeople #sustainablecities #resilientcities #climatecrisis #smartcities"
## [6] "@threesleepydogs @DoroGrelle @realJeff45 @guardian You found one 'scientist'. What about the 7 million scientists, 100s of thousands of peer reviewed studies and every major science organization, colleges and institutions that all agree on AGW? #ClimateAction #ClimateChange #ClimateCrisis #ClimateBrawl"
# Replace special characters, punctuation, & numbers with spaces
twt_txt_chrs <- gsub("[^A-Za-z]"," " , twt_txt_url)
# View text after replacing special characters, punctuation, & numbers
head(twt_txt_chrs)
## [1] " point Indeed David it s disgraceful As a Green Party member I can t adequately express just how much this incoherent BS from fellow and I use this term advisedly European Greens grieves and frustrates me ClimateCrisis WhatClimateCrisis "
## [2] " RBKgreens The Belgian Greens must be very proud of themselves ClimateCrisis WhatClimateCrisis "
## [3] "Really fantastic news from the Net Zero Asset Owner Alliance Moving trillion of investments to zero emissions companies NetZero ClimateAction ClimateCrisis GoodNewsFriday SomeGoodNews positivenews ShareSomethingGood"
## [4] " Cities are and should be at the forefront of climateaction EURACTIV ClimateAction climatecrisis greenrecovery sustainablecities"
## [5] " With more greeninfrastructure large metropolitan areas will become cleaner cooler healthier and more resilient to climatechange greencities greenarchitecture sustainability greenspaces citiesforpeople sustainablecities resilientcities climatecrisis smartcities"
## [6] " threesleepydogs DoroGrelle realJeff guardian You found one scientist What about the million scientists s of thousands of peer reviewed studies and every major science organization colleges and institutions that all agree on AGW ClimateAction ClimateChange ClimateCrisis ClimateBrawl"
The URLs have been removed and special characters, punctuation, & numbers have been replaced with additional spaces in the text.
A corpus is a list of text documents. You have to convert the tweet text into a corpus to facilitate subsequent steps in text processing.
When analyzing text, you want to ensure that a word is not counted as two different words because the case is different in the two instances. Hence, you need to convert text to lowercase.
The code will create a text corpus and convert all characters to lower case.
# Loading text mining library
library(tm)
# Convert text in "twt_gsub" dataset to a text corpus and view output
twt_corpus <- twt_txt_chrs %>%
VectorSource() %>%
Corpus()
head(twt_corpus$content)
## [1] " point Indeed David it s disgraceful As a Green Party member I can t adequately express just how much this incoherent BS from fellow and I use this term advisedly European Greens grieves and frustrates me ClimateCrisis WhatClimateCrisis "
## [2] " RBKgreens The Belgian Greens must be very proud of themselves ClimateCrisis WhatClimateCrisis "
## [3] "Really fantastic news from the Net Zero Asset Owner Alliance Moving trillion of investments to zero emissions companies NetZero ClimateAction ClimateCrisis GoodNewsFriday SomeGoodNews positivenews ShareSomethingGood"
## [4] " Cities are and should be at the forefront of climateaction EURACTIV ClimateAction climatecrisis greenrecovery sustainablecities"
## [5] " With more greeninfrastructure large metropolitan areas will become cleaner cooler healthier and more resilient to climatechange greencities greenarchitecture sustainability greenspaces citiesforpeople sustainablecities resilientcities climatecrisis smartcities"
## [6] " threesleepydogs DoroGrelle realJeff guardian You found one scientist What about the million scientists s of thousands of peer reviewed studies and every major science organization colleges and institutions that all agree on AGW ClimateAction ClimateChange ClimateCrisis ClimateBrawl"
# Convert the corpus to lowercase
twt_corpus_lwr <- tm_map(twt_corpus, tolower)
## Warning in tm_map.SimpleCorpus(twt_corpus, tolower): transformation drops
## documents
# View the corpus after converting to lowercase
head(twt_corpus_lwr$content)
## [1] " point indeed david it s disgraceful as a green party member i can t adequately express just how much this incoherent bs from fellow and i use this term advisedly european greens grieves and frustrates me climatecrisis whatclimatecrisis "
## [2] " rbkgreens the belgian greens must be very proud of themselves climatecrisis whatclimatecrisis "
## [3] "really fantastic news from the net zero asset owner alliance moving trillion of investments to zero emissions companies netzero climateaction climatecrisis goodnewsfriday somegoodnews positivenews sharesomethinggood"
## [4] " cities are and should be at the forefront of climateaction euractiv climateaction climatecrisis greenrecovery sustainablecities"
## [5] " with more greeninfrastructure large metropolitan areas will become cleaner cooler healthier and more resilient to climatechange greencities greenarchitecture sustainability greenspaces citiesforpeople sustainablecities resilientcities climatecrisis smartcities"
## [6] " threesleepydogs dorogrelle realjeff guardian you found one scientist what about the million scientists s of thousands of peer reviewed studies and every major science organization colleges and institutions that all agree on agw climateaction climatechange climatecrisis climatebrawl"
The corpus has been built from the tweet text and converted the characters in the corpus to lowercase.
The text corpus usually has many common words like a, an, the, of, and but. These are called stop words.
Stop words are usually removed during text processing so one can focus on the important words in the corpus to derive insights.
Also, the additional spaces created during the removal of special characters, punctuation, numbers, and stop words need to be removed from the corpus.
# Remove English stop words from the corpus and view the corpus
twt_corpus_stpwd <- tm_map(twt_corpus_lwr, removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(twt_corpus_lwr, removeWords, stopwords("en")):
## transformation drops documents
head(twt_corpus_stpwd$content)
## [1] " point indeed david s disgraceful green party member can t adequately express just much incoherent bs fellow use term advisedly european greens grieves frustrates climatecrisis whatclimatecrisis "
## [2] " rbkgreens belgian greens must proud climatecrisis whatclimatecrisis "
## [3] "really fantastic news net zero asset owner alliance moving trillion investments zero emissions companies netzero climateaction climatecrisis goodnewsfriday somegoodnews positivenews sharesomethinggood"
## [4] " cities forefront climateaction euractiv climateaction climatecrisis greenrecovery sustainablecities"
## [5] " greeninfrastructure large metropolitan areas will become cleaner cooler healthier resilient climatechange greencities greenarchitecture sustainability greenspaces citiesforpeople sustainablecities resilientcities climatecrisis smartcities"
## [6] " threesleepydogs dorogrelle realjeff guardian found one scientist million scientists s thousands peer reviewed studies every major science organization colleges institutions agree agw climateaction climatechange climatecrisis climatebrawl"
# Remove additional spaces from the corpus
twt_corpus_final <- tm_map(twt_corpus_stpwd, stripWhitespace)
## Warning in tm_map.SimpleCorpus(twt_corpus_stpwd, stripWhitespace):
## transformation drops documents
# View the text corpus after removing spaces
head(twt_corpus_final$content)
## [1] " point indeed david s disgraceful green party member can t adequately express just much incoherent bs fellow use term advisedly european greens grieves frustrates climatecrisis whatclimatecrisis "
## [2] " rbkgreens belgian greens must proud climatecrisis whatclimatecrisis "
## [3] "really fantastic news net zero asset owner alliance moving trillion investments zero emissions companies netzero climateaction climatecrisis goodnewsfriday somegoodnews positivenews sharesomethinggood"
## [4] " cities forefront climateaction euractiv climateaction climatecrisis greenrecovery sustainablecities"
## [5] " greeninfrastructure large metropolitan areas will become cleaner cooler healthier resilient climatechange greencities greenarchitecture sustainability greenspaces citiesforpeople sustainablecities resilientcities climatecrisis smartcities"
## [6] " threesleepydogs dorogrelle realjeff guardian found one scientist million scientists s thousands peer reviewed studies every major science organization colleges institutions agree agw climateaction climatechange climatecrisis climatebrawl"
You can see some of the common stop words and all the additional spaces removed in the output.
Popular terms in a text corpus can be visualized using bar plots or word clouds.
However, it is important to remove (custom) stop words present in the corpus first before using the visualization tools.
The code below will check the term frequencies and remove (custom) stop words from the text corpus created for “ClimateCrisis”.
# Loading library for text analysis
library(qdap)
# Extract term frequencies for top 60 words and view output
termfreq <- freq_terms(twt_corpus, 60)
termfreq
# Create a vector of custom stop words
# custom_stopwds <- c(""s", "amp", "can", "new", "medical",
# "will", "via", "way", "today", "come", "t", "ways",
# "say", "ai", "get", "now", "the")
# Remove custom stop words and create a refined corpus
corp_refined <- tm_map(twt_corpus,removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(twt_corpus, removeWords, stopwords("en")):
## transformation drops documents
# Extract term frequencies for the top 20 words
termfreq_clean <- freq_terms(corp_refined, 20)
termfreq_clean
You can see that the corpus has only the relevant and important terms after the stop words are removed. Let’s use this refined corpus to create visualizations in the next two exercises.
Bar plot is a simple yet popular tool used in data visualization.
It quickly helps summarize categories and their values in a visual form.
The code below will create bar plots for the popular terms appearing in a text corpus.
# Extract term frequencies for the top 25 words
termfreq_25w <- freq_terms(corp_refined, 25)
termfreq_25w
# Identify terms with more than 10 counts from the top 25 list
term30 <- subset(termfreq_25w, FREQ > 30)
term30
Terms like climate, join, watch, party are popular. Bar plots quickly help summarize these popular terms in an easily interpretable form.
# Create a bar plot using terms with more than 30 counts
ggplot(term30, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
A word cloud is an image made up of words in which the size of each word indicates its frequency.
It is an effective promotional image for marketing campaigns.
The code below will create word clouds using the words in a text corpus.
library(RColorBrewer)
library(wordcloud)
# Create word cloud with 6 colors and max 50 words
wordcloud(corp_refined, max.words = 50,
colors = brewer.pal(6, "Dark2"),
scale=c(4,1), random.order = FALSE)
Comment
You can see that popular terms like climate and wedonthavetime are in large font sizes and positioned at the center of the word cloud to highlight their relevance and importance.
Latent Dirichlet Allocation algorithm is used for topic modeling.
The document term matrix and the number of topics are input into the LDA()function.
The document term matrix or DTM is a matrix representation of a corpus.
Creating the DTM from the text corpus is the first step towards building a topic model.
# Create a document term matrix (DTM) for ClimateCrisis
dtm_ClimateCrisis <- DocumentTermMatrix(corp_refined)
dtm_ClimateCrisis
## <<DocumentTermMatrix (documents: 9112, terms: 22742)>>
## Non-/sparse entries: 149067/207076037
## Sparsity : 100%
## Maximal term length: 72
## Weighting : term frequency (tf)
# Find the sum of word counts in each document
rowTotals <- apply(dtm_ClimateCrisis, 1, sum)
head(rowTotals)
## 1 2 3 4 5 6
## 23 8 21 8 21 27
# Select rows with a row total greater than zero
dtm_ClimateCrisis_new <- dtm_ClimateCrisis[rowTotals > 0, ]
dtm_ClimateCrisis_new
## <<DocumentTermMatrix (documents: 9103, terms: 22742)>>
## Non-/sparse entries: 149067/206871359
## Sparsity : 100%
## Maximal term length: 72
## Weighting : term frequency (tf)
Comment
You can see that the final DTM has 233 documents and 355 terms. The code below will use this DTM to perform topic modeling
Topic modeling is the task of automatically discovering topics from a vast amount of text.
You can create topic models from the tweet text to quickly summarize the vast information available into distinct topics and gain insights.
The code below will extract distinct topics from tweets on "".
# Load libraries
library(topicmodels)
# Create a topic model with 5 topics
topicmodl_5 <- LDA(dtm_ClimateCrisis_new, k = 5)
# Select and view the top 10 terms in the topic model
top_10terms <- terms(topicmodl_5, 10)
top_10terms
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "climatecrisis" "climatecrisis" "climatechange" "climate"
## [2,] "the" "climatechange" "amp" "climateaction"
## [3,] "climate" "climate" "climatecrisis" "climatecrisis"
## [4,] "climateemergency" "climateaction" "will" "can"
## [5,] "climatechange" "amp" "earth" "the"
## [6,] "will" "tackle" "climate" "covid"
## [7,] "can" "will" "add" "need"
## [8,] "climateaction" "nature" "climateemergency" "this"
## [9,] "amp" "grow" "covid" "time"
## [10,] "change" "just" "church" "world"
## Topic 5
## [1,] "climatecrisis"
## [2,] "climateemergency"
## [3,] "climate"
## [4,] "the"
## [5,] "climateaction"
## [6,] "will"
## [7,] "need"
## [8,] "amp"
## [9,] "now"
## [10,] "globalwarming"
Comment
By comparison for TEDxGlaClimate, sustainability, carbon, energy, water, and global warming aren’t in the top topics.
Sentiment analysis is useful in social media monitoring since it gives an overview of people’s sentiments.
Climate change is a widely discussed topic for which the perceptions range from being a severe threat to nothing but a hoax.
The code below will perform sentiment analysis and extract the sentiment scores for tweets on “Climate change”.
These cam be used to plot and analyze how the collective sentiment varies among people.
library(syuzhet)
# Perform sentiment analysis for tweets on `ClimateCrisis`
sa.value <- get_nrc_sentiment(ClimateCrisis_twts$text)
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# View the sentiment scores
head(sa.value, 10)
Next, these scores are used to plot the sentiments and analyze the results.
Can we plot and analyze the most prevalent sentiments among people and see how the collective sentiment varies for ClimateCrisis?
# Calculate sum of sentiment scores
score <- colSums(sa.value[,])
# Convert the sum of scores to a data frame
score_df <- data.frame(score)
# Convert row names into 'sentiment' column and combine with sentiment scores
score_df2 <- cbind(sentiment = row.names(score_df),
score_df, row.names = NULL)
print(score_df2)
## sentiment score
## 1 anger 3609
## 2 anticipation 5449
## 3 disgust 1856
## 4 fear 4747
## 5 joy 3767
## 6 sadness 3053
## 7 surprise 2574
## 8 trust 6571
## 9 negative 6912
## 10 positive 11063
# Plot the sentiment scores
ggplot(data = score_df2, aes(x = sentiment, y = score, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Comment
For TEDxGlaClimate, it is interesting to see that positive sentiments collectively outnumber the negative ones. Trust, anticipation and fear’ are notable too.
Twitter users tweet, like, follow, and retweet creating complex network structures. We can analyse these network structures and visualize the relationships between these individual people as a retweet network. By extracting geolocation data from the tweets we’ll also discover how to display tweet locations on a map, and answer powerful questions such as which states or countries are talking about your brand the most? Geographic data adds a new dimension to the Twitter data analysis.
A retweet network is a network of twitter users who retweet tweets posted by other users.
People who retweet on Climate can be potential players for broadcasting messages of a Climate conference.
For starters, the following code will prepare the tweet data on TEDxGlaClimate for creating a retweet network.
# Extract source vertex and target vertex from the tweet data frame
rtwt_df <- twts_tedx_climate[, c("screen_name" , "retweet_screen_name" )]
# View the data frame
head(rtwt_df)
# Remove rows with missing values
rtwt_df_new <- rtwt_df[complete.cases(rtwt_df), ]
# Create a matrix
rtwt_matrx <- as.matrix(rtwt_df_new)
head(rtwt_matrx)
## screen_name retweet_screen_name
## [1,] "BrusselsParking" "WeDontHaveTime"
## [2,] "Qinghan_Bian" "WeDontHaveTime"
## [3,] "CBeringher" "WeDontHaveTime"
## [4,] "EleniSakkoula" "zebunsia"
## [5,] "HingYM" "WeDontHaveTime"
## [6,] "HingYM" "WeDontHaveTime"